Fast genotyping of known SNPs through approximate k-mer matching

نویسندگان

  • Ariya Shajii
  • Deniz Yörükoglu
  • Yun William Yu
  • Bonnie Berger
چکیده

MOTIVATION As the volume of next-generation sequencing (NGS) data increases, faster algorithms become necessary. Although speeding up individual components of a sequence analysis pipeline (e.g. read mapping) can reduce the computational cost of analysis, such approaches do not take full advantage of the particulars of a given problem. One problem of great interest, genotyping a known set of variants (e.g. dbSNP or Affymetrix SNPs), is important for characterization of known genetic traits and causative disease variants within an individual, as well as the initial stage of many ancestral and population genomic pipelines (e.g. GWAS). RESULTS We introduce lightweight assignment of variant alleles (LAVA), an NGS-based genotyping algorithm for a given set of SNP loci, which takes advantage of the fact that approximate matching of mid-size k-mers (with k = 32) can typically uniquely identify loci in the human genome without full read alignment. LAVA accurately calls the vast majority of SNPs in dbSNP and Affymetrix's Genome-Wide Human SNP Array 6.0 up to about an order of magnitude faster than standard NGS genotyping pipelines. For Affymetrix SNPs, LAVA has significantly higher SNP calling accuracy than existing pipelines while using as low as ∼5 GB of RAM. As such, LAVA represents a scalable computational method for population-level genotyping studies as well as a flexible NGS-based replacement for SNP arrays. AVAILABILITY AND IMPLEMENTATION LAVA software is available at http://lava.csail.mit.edu CONTACT [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multiplexing Schemes for Generic SNP Genotyping Assays

A generic genotyping assay utilizes a fixed set of reagents, which is independent of the actual target sample, to determine all present alleles. An example is the interrogation of several amplicons spanning polymorphic sites using an all k-mer array. Due to the high cost associated with a genotyping experiment, it is desirable to design a set of experiments, which maximizes the number of SNPs t...

متن کامل

ar X iv : c s / 05 12 05 2 v 1 [ cs . D S ] 1 4 D ec 2 00 5 High - Throughput SNP Genotyping by SBE / SBH ⋆

Despite much progress over the past decade, current Single Nucleotide Polymorphism (SNP) genotyping technologies still offer an insufficient degree of multiplexing when required to handle user-selected sets of SNPs. In this paper we propose a new genotyping assay architecture combining multiplexed solution-phase single-base extension (SBE) reactions with sequencing by hybridization (SBH) using ...

متن کامل

A Fast Algorithm for Approximate String Matching on Gene Sequences

Approximate string matching is a fundamental and challenging problem in computer science, for which a fast algorithm is highly demanded in many applications including text processing and DNA sequence analysis. In this paper, we present a fast algorithm for approximate string matching, called FAAST. It aims at solving a popular variant of the approximate string matching problem, the k-mismatch p...

متن کامل

Squeakr: an exact and approximate k-mer counting system

Motivation k-mer-based algorithms have become increasingly popular in the processing of high-throughput sequencing data. These algorithms span the gamut of the analysis pipeline from k-mer counting (e.g. for estimating assembly parameters), to error correction, genome and transcriptome assembly, and even transcript quantification. Yet, these tasks often use very different k-mer representations ...

متن کامل

Multiplex automated primer extension analysis: simultaneous genotyping of several polymorphisms.

Accurate and fast genotyping of single nucleotide polymorphisms (SNPs) is of significant scientific importance for linkage and association studies. We report here an automated fluorescent method we call multiplex automated primer extension analysis (MAPA) that can accurately genotype multiple known SNPs simultaneously. This is achieved by substantially improving a commercially available protoco...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 32 17  شماره 

صفحات  -

تاریخ انتشار 2016